Introduction

In this post, I begin an investigation into the Motor Vehicle Collisions - Crashes dataset from NYC Open Data. The purpose of my inquiry is twofold. First, the Vision Zero initiative is an important effort by the city to reduce traffic deaths and I wanted to get acquainted with some of the data behind it. Second, just to be extra topical and relevant, I am interested to understand the relationship between traffic collisions and the Congestion Relief Zone introduced in early 2025. Through a few posts, my goal will be to explore these data and build some models to better understand the impact that the Congestion Relief Zone has on vehicle crashes (decreased due to lower car volume? increased due to less slow-moving traffic?).

This post will be focused on exploratory data analysis. I will endeavor to get a handle on the main crashes dataset, join it up with some various spatial attributes (borough, census tract, etc.), and create some visualizations to help me understand how to tackle the modeling component of the project. My rough idea is to use a version of this Besag York Mollié (BYM) Bayesian hierarchical model. In the linked paper, Morris et al. use the BYM model to model motor vehicle crashes involving school children at the census tract level. My thinking is that, if I can get their model to work, I could add a Difference-in-Difference component to capture changes in pre- and post-congestion pricing and get some sort of treatment effect. We’ll see how far I get in that but, for now, I present some exploratory data analysis along with my thoughts.

The GitHub repository for this project is available here. Also, note that you can expand code chunks if you wish using the button at the top-right of any visualizations.

Load Individual-Level Crash Data

In terms of data loading, I’m going to be working with the NYC Motor Vehicle Collisions - Crashes dataset from NYC Open Data and will also be loading (a) the Central Business District shape (i.e., the congestion pricing zone) and (b) NYC Census Tract data so I can aggregate the individual crashes to the census tract level. Links to the various data sources are below:

Interactive Data Table

Below, you’ll find an interactive data table containing a random sample of 10,000 of the 116,097 collisions records left after some data cleaning (you can see the data cleaning steps in the expandable code chunk).

# Helper variables
cp_initial_date <- as.Date("2025-01-05", format = "%Y-%m-%d")

# Loading Central Business District Shape: https://data.ny.gov/Transportation/MTA-Central-Business-District-Geofence-Beginning-J/srxy-5nxn/about_data
cbd_geofence <- read.csv("data/MTA_Central_Business_District_Geofence__Beginning_June_2024_20250605.csv", stringsAsFactors = FALSE)
cbd_geofence <- st_as_sfc(cbd_geofence$polygon, crs = 4326)

# Loading NYC Census Tracts: https://data.cityofnewyork.us/City-Government/2020-Census-Tracts/63ge-mke6/about_data 
nyc_tracts <- read.csv("data/2020_Census_Tracts_20250606.csv", stringsAsFactors = FALSE) %>%
  rename("geometry" = "the_geom")
nyc_tracts <- st_as_sf(nyc_tracts, wkt = "geometry", crs = 4326)
nyc_tracts <- nyc_tracts %>%
  mutate(
    area_sqm = Shape_Area / 27878400
  ) %>%
  select(
    geometry,
    BoroCT2020,
    BoroCode,
    BoroName,
    area_sqm
  )

# Loading collisions data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data 
filter <- c(0, NA)

crashes <- read.csv("data/Motor_Vehicle_Collisions_-_Crashes_20250605.csv") %>%
  select(CRASH.DATE,
         LATITUDE,
         LONGITUDE,
         NUMBER.OF.PERSONS.INJURED,
         NUMBER.OF.PERSONS.KILLED,
         NUMBER.OF.PEDESTRIANS.INJURED,
         NUMBER.OF.PEDESTRIANS.KILLED) %>%
  filter(! LATITUDE %in% filter,
         ! LONGITUDE %in% filter) %>%
  rename(
    "date" = "CRASH.DATE",
    "lat" = "LATITUDE",
    "long" = "LONGITUDE",
    "persons_inj" = "NUMBER.OF.PERSONS.INJURED",
    "persons_death" = "NUMBER.OF.PERSONS.KILLED",
    "ped_inj" = "NUMBER.OF.PEDESTRIANS.INJURED",
    "ped_death" = "NUMBER.OF.PEDESTRIANS.KILLED"
  ) %>%
  mutate(
    date = mdy(date)
    ) %>%
  st_as_sf(coords = c("long","lat"), crs = 4326)

# Associating CP Zone, Pre/Post, Treatment, Borough, and Census Tract w/ Observations
crashes <- crashes %>%
  mutate(
    cp_zone = as.integer(lengths(st_intersects(geometry, cbd_geofence)) > 0),
    after_cp = ifelse(date >= cp_initial_date, 1, 0),
    treatment = ifelse(cp_zone == 1 & after_cp == 1, 1, 0)
  ) %>%
  st_join(nyc_tracts, join = st_within) %>%
  filter(! is.na(area_sqm))

# Interactive Data Table
set.seed(50)

crashes_subset <- crashes %>%
  sample_n(10000)

crashes_dt <- crashes_subset %>%
  st_drop_geometry()

datatable(crashes_dt,
          extensions = 'Buttons',
          filter = "top",
          colnames = c(
            "Date",
            "Persons Injured",
            "Persons Killed",
            "Pedestrians Injured",
            "Pedestrians Dead",
            "CP Zone",
            "Before/After CP",
            "Treatment",
            "Census Tract",
            "Borough Code",
            "Borough Name",
            "Census Tract Area"
          ),
          rownames = FALSE,
          options = list(
            autoWidth = TRUE,
            scrollX = TRUE
          ),
           # Add this argument for inline CSS
  class = 'compact', # optional: reduces padding
  escape = FALSE
) %>%
  formatStyle(
    columns = names(crashes_dt),
    `white-space` = "nowrap",
    `height` = "1.5em",
    `line-height` = "1.5em"
  )

Interactive Map

Next, I created an interactive map of the same sample of 10,000 collisions, along with the congestion pricing zone. This doesn’t tell us a whole lot yet, but you can see that lower Manhattan is somewhat of a hotspot for crashes. We’ll explore that with more maps later on.

# Create an interactive map:
ind_crashes_map <- leaflet(data = crashes_subset) %>%
  addTiles() %>%
  setView(lng = -73.9, lat = 40.7, zoom = 10) %>%
  addMarkers(clusterOptions = markerClusterOptions(
    maxClusterRadius = 40,
    showCoverageOnHover = TRUE),
    popup = ~paste(
      "Date:", date, "<br>",
      "Persons Injured:", persons_inj, "<br>",
      "Persons Killed:", persons_death)
  ) %>%
  addPolygons(data = cbd_geofence,
              color = "red",
              weight = 1,
              fillOpacity = 0.5) %>%
  addLegend(
    position = "bottomright",
    colors = "red",
    labels = "Within Zone",
    title = "Congestion Pricing Zone",
    opacity = 0.8
  ) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addControl(
  html = '<h4 style="text-align: right;">
            Crash Incidents in NYC<br>
            <span style="display: block; margin-top: 0.4em;">2024-25</span>
          </h4>',
    position = "topright"
  ) %>%
  addResetMapButton()

ind_crashes_map

With these basic visualizations out of the way, I think it would be interesting to begin to aggregate the observations to the tract level. The BYM model mentioned above relies on an Intrinsic Conditional Auto-Regressive (ICAR) term that models counts in a discrete geographic area using surrounding areas. I’ll get into it more in the next post, but TLDR: we need shapes (census tracts) with a single measure (count of crashes).

Aggregate to Census Tract Level

# Aggregating/summarizing data so census tract level, no time component
crashes_grouped <- crashes %>%
  group_by(BoroCT2020) %>%
  summarize(tot_crashes = n(),
            area = mean(area_sqm)) %>%
  ungroup() %>%
  mutate(
    crashes_per_area = tot_crashes / area
  ) %>%
  st_drop_geometry()

crashes_tracts_overall <- nyc_tracts %>%
  right_join(crashes_grouped, 
             by = "BoroCT2020")

Our first step in the journey of aggregation is to simply aggregate to the census tract level, irrespective of time. That is, sum up the counts of crashes per census tract for the entire range of the data (in this case, 1/1/2024 to 6/1/2025). From there, my first priority was to establish that crash counts tend to cluster. It follows intuitively that census tracts near to each other are likely to have some of the same characteristics that might impact the number of crashes we expect to see: traffic volume, speed limits, road widths. Beyond intuition, though, I want to demonstrate this clustering as justification for the future spatial autoregressive models that I want to run.

Map of Crash Count by Tract

First, a simple map of crash counts per square mile for each tract in the city. I added the “per square mile” bit to handle cases where bigger census tracts are more likely to have more crashes, simply due to their size. In the map below, you can pretty easily see some apparent clustering. Further, this clustering tends to occur in places you might expect: busy, road-heavy, traffic heavy areas like Lower Manhattan, Downtown Brooklyn, and East Harlem/the Bronx. Notably, the Lower Manhattan area is pretty much consistent with the congestion pricing zone. The fact that this seems to be a hotspot is expected, but good to visually confirm.

# Mapping crashes per square mile across the city
choropleth_overall <- tm_shape(crashes_tracts_overall) + 
  tm_polygons(fill = "crashes_per_area",
              fill.scale = tm_scale_intervals(values = "brewer.yl_or_rd",
                                              style = "jenks"),
              fill.legend = tm_legend(title = "Crashes per Sq. Mile"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

choropleth_overall

For all the visual clarity of the graph, it’s always a good idea to let statistics back up what you anecdotally observe.

Moran’s I: Statistical Test for Clustering

There’s an inferential statistic for clustering: Moran’s I, which measures spatial autocorrelation. In other words, it tests if there is more (or less) correlation between nearby/bordering areas than a random distribution of values would suggest. In statistical inference speak, the null hypothesis is that there is no spatial clustering, and the alternative hypothesis is that areal units near to each other are more correlated. I hope I got all that right.

In more simple speak, if there’s a statistically significant result from the test, that’s a good sign for there being clustering. Given the visual examination above, it seems likely that we get a statistically significant result.

I should note here that, as you will see later on, the distribution of crash counts is positively skewed (i.e., has a long right tail, which makes sense given the lower bound at 0). Apparently, Moran’s I is pretty sensitive to this issue, and skewness is likely to result in more false positives. To remedy this issue, I transformed the crash counts to the square root scale (initially, I tried the log scale, but this resulted in a left skewed distribution and cannot handle counts of 0). This transformation was only for this piece of the analysis and, unless I say otherwise, I’ll be using raw crash counts moving foward.

# Sqrt transform data to enforce normal distribution for Moran's I
crashes_tracts_overall <- crashes_tracts_overall %>%
  mutate(
    sqrt_crashes_per_area = sqrt(crashes_per_area)
  )

# Get weights matrix
# Identify tracts with no neighbors
neighbors <- poly2nb(crashes_tracts_overall, queen = TRUE)
no_neighbors <- which(card(neighbors) == 0)

# Filter the original spatial dataframe to exclude these tracts
if (length(no_neighbors) > 0) {
  crashes_tracts_overall_clean <- crashes_tracts_overall[-no_neighbors, ]
} else {
  crashes_tracts_overall_clean <- crashes_tracts_overall
}

# Rebuild neighbors and weights on the cleaned spatial data
neighbors_clean <- poly2nb(crashes_tracts_overall_clean, queen = TRUE)
weights_clean <- nb2listw(neighbors_clean, style = "W", zero.policy = TRUE)

# Conducting Moran's I test for global clustering
moran_global <- moran.test(crashes_tracts_overall_clean$sqrt_crashes_per_area, 
                           weights_clean, 
                           zero.policy = TRUE)

# Create a summary data frame
morans_df <- tibble(
  Statistic = c("Moran's I", "Expected I", "Variance", "P-value"),
  Value = c(
    moran_global$estimate["Moran I statistic"],
    moran_global$estimate["Expectation"],
    moran_global$estimate["Variance"],
    moran_global$p.value
  )
)

# Print as a pretty table
knitr::kable(morans_df, digits = 5) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Statistic Value
Moran’s I 0.59139
Expected I -0.00043
Variance 0.00015
P-value 0.00000

And hey presto! The p-value of < 0.00001 (it’s actually way, way lower than that, but I couldn’t quickly figure out a good way to show it) indicates there’s strong spatial correlation, and there’s less than a 1-in-100,000 chance that this clustering occurred by chance, roughly speaking. I’m really not that concerned with the perfect interpretation of the test results, this was more about proving there to be spatial correlation.

Our next step will be to disaggregate the data into discrete time buckets, because I’m interested not only in modeling the spatial autocorrelation between the tracts, but also the impact of the congestion pricing zone implementation on 1/5/2025.

Disaggregate Data to Monthly Crash Counts

Time Series of Monthly Crash Counts by Borough

Check Distributions of Crash Counts

Set Up Data Table for Analysis